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Abstract 

We consider the problem of online linear regression on individual sequences. The goal in this paper is for the 
forecaster to output sequential predictions which are, after T time rounds, almost as good as the ones output 
by the best linear predictor in a given £^-ball in W^. We consider both the cases where the dimension d is 
small and large relative to the time horizon T. We first present regret bounds with optimal dependencies 
on d, T, and on the sizes J7, X and Y of the £^-ball, the input data and the observations. The minimax 
regret is shown to exhibit a regime transition around the point d = VtUX/(2Y). Furthermore, we present 
efficient algorithms that are adaptive, i.e., that do not require the knowledge of U, X, Y, and T, but still 
achieve nearly optimal regret bounds. 
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1. Introduction 

In this paper, we consider the problem of online linear regression against arbitrary sequences of input 
data and observations, with the objective of being competitive with respect to the best linear predictor in 
an £^-ball of arbitrary radius. This extends the task of convex aggregation. We consider both low- and 
high-dimensional input data. Indeed, in a large number of contemporary problems, the available data can 
be high-dimensional — the dimension of each data point is larger than the number of data points. Examples 
include analysis of DNA sequences, collaborative filtering, astronomical data analysis, and cross-country 
growth regression. In such high-dimensional problems, performing linear regression on an £^-ball of small 
diameter may be helpful if the best linear predictor is sparse. Our goal is, in both low and high dimensions, to 
provide online linear regression algorithms along with bounds on £^-balls that characterize their robustness 
to worst-case scenarios. 

1.1. Setting 

We consider the online version of linear regression, which unfolds as follows. First, the environment 
chooses a sequence of observations {yt)t^i in R and a sequence of input vectors {xt)t^i in M'^, both initially 
hidden from the forecaster. At each time instant t € N* = {1,2,...}, the environment reveals the data 
Xt € M"^; the forecaster then gives a prediction yt G M; the environment in turn reveals the observation 
yt G M; and finally, the forecaster incurs the square loss {yt — yt)^. The dimension d can be either small or 
large relative to the number T of time steps: we consider both cases. 

In the sequel, u ■ v denotes the standard inner product between u,v G M.'^, and we set ||it||oo — 
maxi^j^d \uj\ and = J2j=i The £-^-ball of radius C/ > is the following bounded subset of R"^: 

Bi{U) = {ueR'^ : ||tt||i s; U} . 
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Given a fixed radius U > and a time horizon T ^ 1, the goal of the forecaster is to predict almost as well 
as the best linear forecaster in the reference set {x £ W'' i-^ u ■ x £ M : u £ Bi{U)}, i.e., to minimize the 
regret on Bi{U) defined by 

T ( T ~\ 

We shall present algorithms along with bounds on their regret that hold uniformly over all sequence^ 
{xt, yt)i^t^T such that ||a;t||oo < X and \yt\ ^ F for alH = 1, . . . , T, where X,Y > 0. These regret bounds 
depend on four important quantities: t/, X, Y, and T, which may be known or unknown to the forecaster. 

1.2. Contributions and related works 

In the next paragraphs we detail the main contributions of this paper in view of related works in online 
linear regression. 

Our first contribution (Section [2]) consists of a minimax analysis of online linear regression on £^-balls 
in the arbitrary sequence setting. We first provide a refined regret bound expressed in terms of Y, d, and 
a quantity n — VTUX/(2dY). This quantity n is used to distinguish two regimes: we show a distinctive 
regime transitior(f| at k = 1 or d = \/TUX/{2Y). Namely, for k < 1, the regret is of the order of \/T, 
whereas it is of the order of In T for k > 1 . 

The derivation of this regret bound partially relies on a Maurey-type argument used under various forms 
with i.i.d. data, e.g., in [1, 2, 3, 4] (see also bj). We adapt it in a straightforward way to the deterministic 
setting. Therefore, this is yet another technique that can be applied to both the stochastic and individual 
sequence settings. 

Unsurprisingly, the refined regret bound mentioned above matches the optimal risk bounds for stochastic 
setting^ (see also Q). Hence, linear regression is just as hard in the stochastic setting as in the arbi- 
trary sequence setting. Using the standard online to batch conversion, we make the latter statement more 
precise by establishing a lower bound for all n at least of the order of ^/\ad/d. This lower bound extends 
those of ja, ) which only hold for small k of the order of 1 /d. 

The algorithm achieving our minimax regret bound is both computationally inefficient and non-adaptive 
(i.e., it requires prior knowledge of the quantities U , X, Y , and T that may be unknown in practice). 
Those two issues were first overcome by [lo| via an automatic tuning termed self- confident (since the 
forecaster somehow trusts himself in tuning its parameters). They indeed proved that the self-confident 
p-norm algorithm with p = 21nfi and tuned with U has a cumulative loss Lt — 'Yln=iiyt ~ Vt)'^ bounded by 



Lt L'T + mXJ{e\n d) + (32e In d) U^X' 



^ mXYy/eT In d + (32e In d) U^X^ , 

where — miii{ueR'^:||u||i^i7} Y^J=i{yt — u • Xt)^ ^ TY^. This algorithm is efficient, and our lower bound 
in terms of k shows that it is optimal up to logarithmic factors in the regime k ^ 1 without prior knowledge 
of X, y, and T. 

Our second contribution (Section is to show that similar adaptivity and efficiency properties can be 
obtained via exponential weighting. We consider a variant of the EG* algorithm Q. The latter has a 
manageable computational complexity and our lower bound shows that it is nearly optimal in the regime 



^Actually our results hold whether {xt,yt)t^i is generated by an oblivious environment or a non-oblivious opponent since 
we consider deterministic forecasters. 

^In high dimensions (i.e., when d > u)T, for some absolute constant uj > 0), we do not observe this transition (cf. Figure^. 

*For example, {xt,yt)i!^t!iT rna-y be i.i.d. , or xt can be deterministic and yt = f{xt) + et for an unknown function / and 
an i.i.d. sequence (et)isjtsjT of Gaussian noise. 
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At ^ 1. However, the EG^ algorithm requires prior knowledge of U, X, Y, and T. To overcome this 
adaptivity issue, we study a modification of the EG^ algorithm that relies on the variance-based automatic 



tuning of ll| . The resulting algorithm ~ called adaptive EG^ algorithm - can be applied to general convex 
and differentiable loss functions. When applied to the square loss, it yields an algorithm of the same 
computational complexity as the EG^ algorithm that also achieves a nearly optimal regret but without 
needing to know X, Y, and T beforehand. 

Our third contribution (Section 13. 3p is a generic technique called loss Lipschitzification. It transforms 
the loss functions u i— >■ {yt — u ■ Xtf' (or u ^ \yt — u ■ Xt\ if the predictions are scored with the a- loss for a 
real number a ^ 2) into Lipschitz continuous functions. We illustrate this technique by applying the generic 
adaptive EG* algorithm to the modified loss functions. When the predictions are scored with the square 
loss, this yields an algorithm (the LEG algorithm) whose main regret term slightly improves on that derived 
for the adaptive EG* algorithm without Lipschtizification. The benefits of this technique are clearer for 
loss functions with higher curvature: if a > 2, then the resulting regret bound roughly grows as U instead 
of a naive [/"/^. 

Finally, in Section|4j we provide a simple way to achieve minimax regret uniformly over all ^^-balls Bi{U) 
for U > Q. This method aggregates instances of an algorithm that requires prior knowledge of U . For the 
sake of simplicity, we assume that X, and T are known, but explain in the discussions how to extend the 
method to a fully adaptive algorithm that requires the knowledge neither of C/, X , Y, nor T. 

This paper is organized as follows. In Section[2j we establish our refined upper and lower bounds in terms 
of the intrinsic quantity k. In Section [31 we present an efficient and adaptive algorithm — the adaptive 
EG* algorithm with or without loss Lipschitzification — that achieves the optimal regret on Bi(U) when 
U is known. In Section HI we use an aggregating strategy to achieve an optimal regret uniformly over all 
i'^-balls Bi{U), for U>0, when X, Y, and T are known. Finally, in Section [SJ we discuss as an extension a 
fully automatic algorithm that requires no prior knowledge of U, X, Y , or T. Some proofs and additional 
tools are postponed to the appendix. 



2. Optimal rates 

In this section, we first present a refined upper bound on the minimax regret on Bi(U) for an arbitrary 
U > 0. In Corollary [TJ we express this upper bound in terms of an intrinsic quantity k ^ \/TUX/{2dY). 
The optimality of the latter bound is shown in Section 12.21 

We consider the following definition to avoid any ambiguity. We call online forecaster any sequence 
F — (/t)t^i of functions such that /t : R'* x (R*^ x R)*^^ -> R maps at time t the new input Xt and the past 
data {xs,ys)i^s^t-i to a prediction ft{xt; {xs,ys)i^s^t-i) ■ Depending on the context, the latter prediction 
may be simply denoted by ft{xt) or by yt- 



2.1. Upper bound 

Theorem 1 (Upper bound). Let d,T €W 

base predictions and observations satisfies 



and U,X,Y > 0. The minimax regret on Bi{U) for bounded 
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where the infimum is taken over all forecasters F and where the supremum extends over all sequences 
{xt,yt)i^t<iT e (R'' X R)^ such that |yi|, . . . , |?/t| =^ Y and ||cci||^ ,. . . , \\xt\\^ ^ X. 



Theorem [T] improves the bound of [1, Theorem 5.11] for the EG^ algorithm. First, our bound depends 
logarithmically — as opposed to linearly — on U for U > 2dY/{^/TX). Secondly, it is smaller by a factor 
ranging from 1 to Vln d when 

y./MlTW^^^^^ (1) 



Xy T\n2 ^ VTX 

Hence, Theorem [T] provides a partial answer to a questior(f| raised in ^ about the gap of \/ln(2rf) between 
the upper and lower bounds. 

Before proving the theorem (see below), we state the following immediate corollary. It expresses the 
upper bound of Theorem [1] in terms of an intrinsic quantity n = \/TUX/{2dY) that relates ^/TUX/{2Y) 
to the ambient dimension d. 

Corollary 1 (Upper bound in terms of an intrinsic quantity). Let d,T & W , and U,X,Y > 0. The upper 
bound of TheoremUl expressed in terms of d, Y , and the intrinsic quantity k = VTUX/{2dY) reads: 



{T T 
V(yt - ytf - inf ,,y](yt - « • xtf 



F 



QdY^n,/2U2d) rf «^ < 

52dy2.Vln(l + l/-) 
32dr2(ln(l + 2k) + 1) if K>1 



^ K 1 

2d^\n2 



The upper bound of Corollary [T] is shown in Figure [TJ Observe that, in low dimension (Figure 1(b)), a 
clear transition from a regret of the order of ^/T to one of InT occurs at k = 1. This transition is absent 
for high dimensions: for d ^ ujT, where lo = (32(ln(3) + 1)) , the regret bound ?>2dY^{\n{l + 2k) + l) is 
worse than a trivial bound of TY^ when k ^ 1. 
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Y' Ind 











52dl^ K\/ln(l+l/K 



(a) High dimension d ^ uiT. 



y2 y 
Y'd 
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(b) Low dimension d < bjT. 



Figure 1: The regret bound of Corollary [T] over B\{U) as a function of k = \/TUX/(2dY). The constant c is chosen to ensure 
continuity at /t = 1, and uj = (32(ln(3) + l))"^ We define: K^nin = + 2d)/(2dVTn2) and Kmax = (e(^/ - l)/2. 



We now prove Theorem [T] The main part of the proof relies on a Maurey-type argument. Although this 
argument was used in the stochastic setting 1, 2, 3, 4], we adapt it to the deterministic setting. This is yet 
another technique that can be applied to both the stochastic and individual sequence settings. 



^The authors of 9] asked: "For large d there is a significant gap between the upper and lower bounds. We would like to 
know if it possible to improve the upper bounds by eliminating the In ci factors." 
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Proof (of Theorem [T]): First note from Lemma [5] in [Appendix B| tliat tlie minimax regret on Bi{U) is 
upper boundecjfl by 

TUX\ 



min I ■iUXYyf2T\n{2d), 32 dY'^ In + j + rf^^ | . (2) 

Therefore, the first case U < ^ \J '"t in 2'^'' ^'^'-^ third case [/ > are straightforward. 



Therefore, we assume in the sequel that y y '"jj^^*^-* ^ ?7 ^ . 

We use a Maurey-type argument to refine the regret bound This technique was used under various 
forms in the stochastic setting, e.g., in [HSiSH]- It consists of discretizing Bi(U) and looking at a random 
point in this discretization to study its approximation properties. We also use clipping to get a regret bound 
growing as U instead of a naive . 

More precisely, we first use the fact that to be competitive against Bi{U), it is sufficient to be compet- 
itive against its finite subset 



where TO = lal with a ^ — Jr(ln2)/lnf 1 + ^ 

V V Vtux J 



By Lemma [71 in [Appendix C| and since to > (see below), we indeed have 



^ inf Y.{yt-u-Xtf + ^=UXYjT\Jl + ^^^\, (3) 



where ([3]) follows from to = [aj ^ a/2 since 1 (in particular, to > as stated above). 

To see why a ^ 1, note that it suffices to show that XyJ\Yi{l + a;) ^ 2d\/hi 2 where we set x = 
2dY/(VTUX). But from the assumption U ^ (y/X)0n(l + 2(i)/(rin2), we have x ^ 2d^y\n{2)/ \n{l + 2d) 
y, so that, by monotonicity, a;-\/ln(l + x) ^ y-\/ln(l + y) ^ y-\/ln(l + 2d) = 2d\/\n 2. 

Therefore it only remains to exhibit an algorithm which is competitive against Bu „i sA, an aggregation 
price of the same order as the last term in ([3]). This is the case for the standard exponentially weighted 
average forecaster applied to the clipped predictions 



[u ■ Xt] Y — min|y, max{— F, m • a;t} | , u € B, 



U,m 

and tuned with the inverse temperature parameter 77 — 1/(8K^). More formally, this algorithm predicts at 
each time t = 1 , . . . , T as 



®As proved in Lemma (5] the regret bound l(2]l is achieved either by the EG^ algorithm, the algorithm SeqSEW:^''' of |12|| 
(we could also get a slightly worse bound with the sequential ridge regression forecaster [T3. ll4l| ). or the trivial null forecaster. 



where pi{u) = l/|i3;7„i| (denoting by |i?(7,m| the cardinality of the set Bjj^m), and where the weights pt{u) 
are defined for all i = 2, . . . , T and u e Bij^m by 



pt{u) ^ 



exp (~vT!sJi{y<^ - \^ ■ '^s\y)^ ^ 

, 2 



By Lemma ini in [Appendix B[ the above forecaster tuned with r] = 1/(8F^) satisfies 



T 

$](yt-yt)'- inf Y.^yt - u ■ xt f ^ 8YHn\Bu^m\ 



^8y^ln('^i^^±^') (4) 



= 8rV(f + ln(f + 2d/m)) 8Y^a{l + ln(l + 2d/a)) (5) 

, , , 2dr /in(f + 2(ir/(Vr[/x)) \ 

fTUX\ In 2 I 




s; sy^a + ler^a in l + — (6) 



( ^= + i6%/in2 ) L/xyJrin( 1+ V (7) 



To get (HI we used Lemma El in [Appendix C| Inequality ([5]) follows by definition of 771 ^ a and the fact that 
X I— >■ x(l + ln(l + A/x)) is nondecreasing on for all ^ > 0. Inequality ([6|) follows from the assumption 
U ^ 2(iF/(A/TAr) and the elementary inequality ln(l + a;-\/ln(l + 2:)/ In 2) ^ 21n(l + x) which holds for all 
X ^ 1 and was used, e.g., at the end of 0, Theorem 2-a)]. Finally, elementary manipulations combined with 
the assumption that 2dY/{VTUX) > 1 lead to ©. 

Putting Eqs. ^ and ([7]) together, the previous algorithm has a regret on BiiU) which is bounded from 
above by 



'\n2 J \ \ VTUX, 

which concludes the proof since 10/-\/ln2 + 16-\/ln 2 ^ 26. □ 

2.2. Lower bound 

Corollary [1] gives an upper bound on the regret in terms of the quantities d, Y, and k = ^/Tu X / (2dY) . 
We now show that for all d G N*, F > 0, and k ^ A/ln(l + 2d)/{2d^/ln2), the upper bound can not be 
improvec|3 up to logarithmic factors. 



■^For T sufficiently large, we may overlook the case k < ^ln(l + 2d)/(2d\/ln2) or VT < {Y/{UX))y/ln{l + 2d) / In 2. 
Observe that in this case, the minimax regret is already of the order of ln(l + d) (cf. Figure^. 



6 



Theorem 2 (Lower bound). For all d G N*, Y > 0, and k ^ ^2dV^^ ' ^^^^^ exist T ^ 1, U > 0, and 
X >0 such that VTUX/{2dY) = k and 



T T ^ 



inf sup SZ^iyt- yt) - „ inf , >^(2/t ~ u ■ Xt) 



.dY^nJliUl + l/n) if ^ K ^ 1 , 

ln(2+16d2) ^ \ ^ I I J 2d^^ ^ ^ ' 

—r-^ — ^dr2 i/ K > 1 , 

ln(2+16d2) 



where Ci,C2 > are absolute constants. The infimum is taken over all forecasters F and the supremum is 
taken over all sequences {xt,yt)i^ts^T G (K'' x K.)^ such that |, . . . , ^ Y and ||a;i||^,..., Ha^Tlloo ^ ■ 

The above lower bound extends those of [8i, |9[ , which hold for small k of the order of 1/d. The proof 
is postponed to [Appendix A.l| We perform a reduction to the stochastic batch setting — via the standard 
online to batch conversion — and employ a version of a lower bound of Q . 



3. Adaptation to unknown X, Y and T via exponential weights 

Although the proof of Theorem [T] already gives an algorithm that achieves the minimax regret, the latter 
takes as inputs U, X, Y, and T, and it is inefficient in high dimensions. In this section, we present a new 
method that achieves the minimax regret both efficiently and without prior knowledge of X, Y, and T 
provided that U is known. Adaptation to an unknown U is considered in Section |4l Our method consists of 
modifying an underlying efficient linear regression algorithm such as the EG^ algorithm [91 or the sequential 
ridge regression forecaster Next, we show that automatically tuned variants of the EG^ algorithm 

nearly achieve the minimax regret for the regime d ^ \/TUX/ {2Y). A similar modification could be applied 
to the ridge regression forecaster — without retaining additional computational efficiency — to achieve a 
nearly optimal regret bound of order dy^ln(l + d( ^y^ )^) in the regime d < VTUX/{2Y). The latter 
analysis is more technical and hence is omitted. 

3.1. An adaptive EG^ algorithm for general convex and differentiable loss functions 

The second algorithm of the proof of Theorem [T] is computationally inefficient because it aggregates 
approximately d^ experts. In contrast, the EG* algorithm has a manageable computational complexity 
that is linear in d at each time t. Next we introduce a version of the EG* algorithm — called the adaptive 
EG^ algorithm — that does not require prior knowledge oi X, Y and T (as opposed to the original EG* 
algorithm of Q). This version relies on the automatic tuning of fll'|. We first present a generic version 
suited for general convex and differentiable loss functions. The application to the square loss and to other 
a- losses will be dealt with in Sections 13.21 and 13.31 

The generic setting with arbitrary convex and differentiable loss functions corresponds to the online 
convex optimization setting [l^, [3] and unfolds as follows: at each time t ^ 1, the forecaster chooses a 
linear combination Ut G K'', then the environment chooses and reveals a convex and differentiable loss 
function €t : R'' M, and the forecaster incurs the loss it{ut). In online linear regression under the square 
loss, the loss functions are given by £t{u) = {yt — u ■ Xt)'^. 

The adaptive EG* algorithm for general convex and differentiable loss functions is defined in Figure [21 
We denote by (ej)i^j^d the canonical basis of M'*, by \/£t{u) the gradient of £t at tt G M'', and by Wj£t{u) 
the j-th component of this gradient. The adaptive EG* algorithm uses as a blackbox the exponentially 



weighted majority forecaster of on 2d experts — namely, the vertices iUej of Bi(U) — as in [9]. It 
adapts to the unknown gradient amplitudes ||V£t|joo by the particular choice of rjt due to and defined 
for alH ^ 2 by 



Parameter: radius U > 0. 

Initialization: = {pIi,pIi, ■ ■ • " (l/(2d), ■ • • , l/(2d)) G M^d, 

At each time round t ^ 1, 

d 

1. Output the linear combination Ut ^ ^^^(-Pj't ^ Pjt) ^ Bi{U)\ 

2. Receive the loss function : M"^ — > M and update the parameter rjt+i according to ([5]); 

3. Update the weight vector pj^.^ = {p'l ^,^^,p'^ . . ■ ,Pd t+i^Pd t+i) *^ -^sd defined for all j = 1, . . . , d 
and 7 e {+, -} bjQ 



exp -rit+i^jUVjisius) 



7 A 

Plt+i = 



exp i-Vt+i^ fJ.U\/ki^s 



"For all 7 £ {+, — }, by a slight abuse of notation, -yU denotes U or —U if 7 = + or 7 = — respectively. 



Figure 2: The adaptive EG^ algorithm for general convex and differentiable loss functions (see Proposition^. 



where C = -y/ 2(^2 - l)/(e - 2) and where we set, for all i = 1, . . . , T, 



Ef = inf <^ 2*^ : 2*^ > max max U7„ - z 



j,s k,s\ ( ' 



^^^E E Pis 



E pL^fe, 



s=l l^j^d ■ 

7e{+,-} V 



-} 



Note that £'t_i approximates the range of the zj ^ up to time t—1, while Vt_i is the corresponding cumulative 
variance of the forecaster. 

Proposition 1 (The adaptive EG* algorithm for general convex and differentiable loss functions). 

Let U > 0. Then, the adaptive EG^ algorithm on Bi{U) defined in Figure\^ satisfies, for all T ^ 1 and all 
sequences of convex and differentiabl^ loss functions £1, . . . ,£t '■ K'' —5- K, 



Eitiut)- min y^itiu) 



t=l 



^ \\\7£t{ut)\tj ln(2d) + U (81n(2d) + 12) max^ || V£t(St) 



In particular, the regret is bounded by 4[/(maxi^f^T II V^t(^*t)lloo) (\/2^1n(2(i) + 2 hi(2d) + 3) . 



^Gradients can be replaced with subgradients if the loss functions £t '■ K'* — ^ K are convex but not differentiable. 
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Proof: The proof follows straightforwardly froni a linearization argument and from a regret bound of [11 
applied to appropriately chosen loss vectors. Indeed, first note that by convexity and differentiability of 
£t : M'* ^ R for aU t = 1, . . . , T, we get that 

T T T T 

y^ttiut)- min it{u) = max ^ [itiut) - lt{u)) ^ max \J it{ut) ■ (ut - u) 

T 

= max y V£t(Mt) • (m* -7C^ej) (9) 
7e{+,-}*=i 

T T 

= E E Plt7C/V,^t(«t)- min V7C/V,^t(«0, (10) 

76{ + ,-} ^ ^ 

where © follows by linearity of m X^tli V^t(''it) ■ {'^t ~ w) on the polytope Bi{U), and where (fTU|) follows 
from the particular choice of Mt in Figure [21 

To conclude the proof, note that our choices of the weight vectors £ X2d in Figure [2] and of the time- 
varying parameter rjt in ([5]) correspond to the exponentially weighted average forecaster of [llj, Section 4.2] 
when it is applied to the loss vectors (UW j£t{ut), —UW j£t{ut)) -^^ S R^'', t = 1, . . . ,T. Since at time t 



the coordinates of the last loss vector lie in an interval of length Et ^ 2U \\V£t(ut)\\^, we get from [11 
Corollary 1] that 

T T 

t=i Kj^d 7e{'±i> *=i 

76{±1} 



^4C/J (^^||V£t(wt)|lLjM2d) + C/(81n(2d) + 12) max^||V£t({it)IL ■ 
Substituting the last upper bound in (ITUl) concludes the proof. □ 

3.2. Application to the square loss 

In the particular case of the square loss it{u) — {yt — u ■ Xt)'^ , the gradients are given by W£t{u) — 
—2{yt — u ■ Xt) Xt for all u e R'*. Applying Proposition [TJ we get the following regret bound for the adaptive 
EG"*" algorithm. 

Corollary 2 (The adaptive EG^ algorithm under the square loss). 

Let U > 0. Consider the online linear regression setting defined in the introduction. Then, the adaptive 
EG^ algorithm (see Figure\^ tuned with U and applied to the loss functions it : m i~-> {yt ^ u ■ Xt)^ satisfies, 
for all individual sequences {xi,yi), . . . , {xT,yT) G R'* x R, 



T 

y^{yt~UfXt)'^~ min y^{yt~u-Xt) 



<8C/X. I min (yt- u- xty ]\n{2d) + (137 \n{2d) + 24) (UXY + U'^X^) 

^ 8UXYy/Tln{2d) + (137 ln(2rf) + 24) {UXY + U'^X^) , 

where the quantities X = maxi^^^r ll^^tlloo '^'^'^ ^ ^ maxi^t^T \yt\ o,i^e unknown to the forecaster. 
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Using the terminology of {VA the first bound of CoroUary [5] is an improvement for small losses: 
it yields a small regret when the optimal cmnulative loss T^^T^\\u\\^^uY!n=i(yt ~ ^ ' ^*)^ small. As for 
the second regret bound, it indicates that the adaptive EG^ algorithm achieves approximately the regret 
bound of Theorem [1] in the regime k ^ 1, i.e., d > \/TUX/{2Y). In this regime, our algorithm thus has a 
manageable computational complexity (linear in d at each time t) and it is adaptive in X , Y, and T. 

In particular, the above regret bound is similaiHto that of the original EG* algorithm Theorem 5.11], 
but it is obtained without prior knowledge of X, Y, and T. Note also that this bound is similar to that of 
the self-confident p-norm algorithm of [10| with p — 2 Ind (see Section fTT^ . The fact that we were able to get 
similar adaptivity and efficiency properties via exponential weighting corroborates the similarity that was 
already observed in a non-adaptive context between the original EG* algorithm and the p-norm algorithm 
(in the limit p — ^ +oo with an appropriate initial weight vector, or for p of the order of Ind with a zero 
initial weight vector, cf. [HI). 

Proof (of Corollary [2]) : We apply Proposition [1] with the square loss £t{u) = {yt — u ■ Xt)^ . It yields 

T T 

Elt{ut)~ min ^ lt{u) 
t=i " t=i 




I >:i|V£t(nt)||Ljln(2d) + C/(81n(2d) + 12) max^||V£t(St)||^ . (11) 

Using the equality W£t{u) — ^2{yt — u ■ Xt) Xt for all u e M'', we get that, on the one hand, by the upper 
bound \\xt\\^ < X, 

||V^t(Mt)llL ^4X24(«0 , (12) 
and, on the other hand, maxi^t^y l|V^t(Mt)|loo ^ 2(y + UX)X (indeed, by Holder's inequality, \ut ■ Xt\ ^ 
ll^*lli ll^tlloo ^ UX). Substituting the last two inequalities in (|lip . setting Lt — X]tLi^t(^0 ^^11 ^ 
= min|[„|i X^Li £t{u), we get that 



Lt + 8UXJLt ln(2rf) + (l6 ln(2d) + 24) {UXY + U^X^ 



Solving for Lt via Lemma U in [Appendix B[ we get that 

Lt ^Lt + C+ (8UXy/\n{2d)) ^jL1j, + C + {wX^/\n{2d) 



+ WxJl*j, ln(2d) 4- WX^C\n{2d) + 64U^X^ ln(2d) + C . 



Using that 



C/Xv/C ln(2d) = UXlii(2d)J{m + 24/ ln(2rf)) {UXY + U^X'^) 



< ^U^X^ + UXY \n{2d} J {16 + 24/ ln(2)) {UXY + U^X^) 



= V'16 + 24/ ln(2) {UXY + U^X^) ln(2d) 

and performing some simple upper bounds concludes the proof of the first regret bound. The second one 
follows immediately by noting that min||„||^^[/ 

Et=Ayt - u ■ Xt? < ELi Vt ^ TY^ (since e B,{U)). □ 



^By Theorem 5.11 of d, the original EG* algorithm satisfies the regret bound 2UX^2Bln(2d) + 2U^X'^ ln(2d), where B 
is an upper bound on min||„||^,g(j ^J—i(yt — u ■ xt? (in particular, B TY^). Note that our main regret term is larger by a 
multiplicative factor of 2\/2. However, contrary to Q, our algorithm does not require the prior knowledge of X and B — or, 
alternatively, X, Y, and T. 



10 



3.3. A refinement via Lipschitzification of the loss function 

In Corollary [2] we used the adaptive EG^ algorithm in conjunction with the square loss functions 
it : u ^ {jjt — u ■ XtY . In this section we use yet another instance of the adaptive EG^ algorithm ap- 
plied to a modification it : M'' — > K of the square loss (or the a-loss, see below) which is Lipschitz continuous 
with respect to This leads to slightly refined regret bounds; see Theorem [3] below and Corollaries [3] 

and [4] thereafter . 

We first present the Lipschtizification technique; its use with the adaptive EG^ algorithm is to be 
addressed in a few paragraphs. Since our analysis is generic enough to handle both the square loss and 
other loss functions with higher curvature, we consider below a slightly more general setting than online 
linear regression stricto sensu. Namely, we fix a real number a ^ 2 and assume that the predictions % of 
the forecaster and the base linear predictions u ■ Xt are scored with the a-loss, i.e., with the loss functions 
XI-)- \yt — x]" for alH > 1. The particular case of the square loss (a = 2) is considered in Corollary |3] below, 
while loss functions with higher curvature {a > 2) are addressed in Corollary 01 

The Lipschitzification proceeds as follows. At each time t ^ 1, we set 



1/a 



where \x] = minjfc e Z : fc ^ x} for all x G K. Note that maxi^s<jt_i \ys\ ^ Bt ^ 2^/" maxi^s^f_i |ys|. 
The modified (or Lipschitzified) loss function : K*^ — > M is constructed as follows: 

• if \yt\ > Bt, then 

itiu) = for all ueW' ; 

• li \yt\ ^ Bt, then it is the convex function that coincides with the loss function u. i-> \yt — u ■ Xtl" 
when |m • a;t| ^ Bt and is linear elsewhere. An example of such function is shown in Figure [3] in the 
case where a = 2. It can be formally defined as 




if \u ■ Xt \ ^ Bt, 
if u ■ Xt > Bt, 
alyt + Btl" \u-Xt+Bt) if u - Xt < -Bt. 



it{u) ^ { \yt - Bt^^ + a\yt - Bt^ ^{u-Xt-Bt) if u ■ Xt > Bt, 

|Q-1 



Observe that in both cases \yt\ > Bt and \yt\ ^ Bt, the function it is continuously differentiable. By 
construction it is also Lipschitz continuous with respect to with an easy-to-control Lipschitz constant 
(see [Appendix A^ . Another key property that we can glean from Figure [3] is that, when \yt\ ^ Bt, the 
modified loss function : R"^ — ) R lies in between the a-loss function m i— >■ |yt — it • a;t|" and its clipped 
version: _ 

VmgR'', \yt - [u ■ Xt]Btr ^ ^t{u) \yt - u ■ xt\" , (13) 

where the clipping operator [■]b is defined by [x]b — minji?, niax{— _B, x}} for all x G R and all B > 0. 

Next we illustrate the Lipschitzification technique introduced above: we apply the adaptive EG^ algo- 
rithm to the Lipschitzified loss functions it. The resulting algorithm is called the Lipschitzifying Exponen- 
tiated Gradient (LEG) algorithm and is formally defined in Figure El Recall that (ej)i^j^d denotes the 
canonical basis of R'' and that Vj denotes the j-th component of the gradient. 

We point out that this technique is not specific to the pair of dual norms {\\-\\-^ , IMIoo) ^^-^ to the EG* 
algorithm; it could be used with other pairs (I MI^ , IMIj.) (with 1/p+l/q = 1) and other gradient-based 



algorithms, such as the p-norm algorithm [18|, [10| and its regularized variants (SMIDAS and COMID) 

The next theorem bounds the cumulative a-loss of the LEG algorithm. The proof is postponed to 
[Appendix A.2| It follows from the bound on the adaptive EG* algorithm for general convex and differen- 
tiable loss functions that we derived in Proposition [T] (Section [XT]) . See Corollaries [31 and HI below for regret 
bounds in the particular cases of the square loss (a = 2) or of losses with higher curvature (a > 2). 
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Figure 3: Example with the square loss (a = 2) when \yt \ ^ Bt- The square loss (yt — u-xt)^, its clipped version (j/t — [u-xt]Bt)'^ 
and its Lipschitzified version £t{u) are plotted as a function oi u ■ xt- 



Theorem 3. Assume that the predictions are scored with the a-loss x t-^ \yt — a;|", where a ^ 2 is a real 
number. Let U > 0. Then, the LEG algorithm defined in Figure^ and tuned with U satisfies, for all T ^ 1 
and all individual sequences {xi, yi), . . . , (xt, J/t) G K'' x 



y2\yt-yt\"^ inf V4N+a„[/Xy"/2-i 

— \\u\\ T <f7 ^ — ^ 



t=l 



t=l 



\ 



inf ^^(m) ln(2d) 



(^a'^ ln(2d) + 126„) UXY"-'^ + a'^ ln(2d) t/^x^r 



where the Lipschitzified loss functions £t are defined above, where the quantities X = maxi^t^T H^CiHoo 
and Y = maxi^t^y \yt\ o-fs unknown to the forecaster, and where, setting Ua — 4ck (l + 2^/")"^^ ^ and 
ba — cx + 2^/")" ^ , the constants a'^,a'^,a'^ > are defined by 



a'^ ^ (^fe„(4 + 6/ln2) + 2(l + 2-i/")"/2/7i^^ + 35^ 

a';^ ^ a„ (y^fe„(4 + 6/ln2) + a„) 
a- ^4(1 + 2-1/")" . 



Corollary 3 (Application to the square loss). Consider the online linear regression setting under the square 
loss (i.e., a = 2). Let U > 0. Then, the LEG algorithm defined in Figure^ and tuned with U satisfies, for 
all T ^ 1 and all individual sequences {xi, j/i), . . . , {xt, yr) G K'^ x 



J2{y,~y,f^ inf V£tH+8[/X 



t=l 



inf V£t(it) I ln(2d) 



(I341n(2d) + 58) {UXY + U'^X^) + 12Y 



where the Lipschitzified loss functions £t are defined above and where the quantities X = maxi^t^T \\xt\ 
and Y ^ maxi^t^y \yt\ cire unknown to the forecaster. 
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Parameter: radius U > 0. 

Initialization: B, ^ 0, p, = {vts^Pli, ■ ■ ■ ^pIvPIi) = (l/(2d), • ■ • , l/(2d)) G M^d. 
At each time round t ^ 1, 

d 

1. Compute the linear combination Ut = U {p^t ^ Pjt) ^ Bi{U); 

2. Get Xt e and output the clipped prediction yt = \ut ■ Xt\^ ; 

3. Get ?/( e K. and define the modified loss function : M'' — > M as above; 

4. Update the parameter rjt+i according to dH); 

5. Update the weight vector {Pi t+i^Pi t+i^ ■ ■ ■ ^Pd t+i^Pd t+i) ^ -^2^ defined for all j = l,...,d 
and 7 e {+, -} bjQ 

exp -T]t+i^jUV j£s{us) I 

„7 A V £^1 / 

Pj,t+1 - / t _ \ ■ 

cxp -rit+i ^ nUVkis{us) 

l^fc^d V s=l / 

6. Update the threshold Bt+i ^ (2ri°g2(maxi^,^t |y=r)l)i/" _ 

"For all 7 S {+, — }, by a slight abuse of notation, -yU denotes U or —U if 7 = + or 7 = — respectively. 



Figure 4: The Lipschitzifying Exponentiated Gradient (LEG) algorithm. 



Note that, in the case of the square loss, the first two terms of the bound of Corollary [3] slightly improve 
on those obtained without Lipschitzification (cf. Corollary [5]) since we always have 

T T 

inf y^It{u)i^ inf y^iyt-u-Xt)'^, 

where we used the key property £t{u) ^ {yt — u ■ XtY that holds for all u and alH = 1, . . . , T (by (|13l) 
if \yt\ ^ Bt, obvious otherwise). In particular, the LEG algorithm is adaptive in X, Y, and T; it achieves 
approximately — and efficiently — the regret bound of Theorcm[T]in the regime k ^ 1, i.e., d ^ ^/TUX/{2Y). 

In the case of a- losses with a higher curvature than that of the square loss (a > 2), the improvement is 
more substantial as indicated after the following corollary. 

Corollary 4 (Application to a-losses with a > 2). Assume that the predictions are scored with the a-loss 
^ \yt — xl" , where a > 2. Then, the regret of the LEG algorithm on Bi{U) is at most of the order of 

UXY°'-^^T\n{2d) + [UXY°'-^ + U'^X^Y"'-^^ ln(2d) + , 

where X = maxi^f^y ll^tlloo '^'^'^ ^ ~ niaxi^f^^ \yt\ o.re unknown to the forecaster. The above regret hound 
improves on the bound we would have obtained via a similar analysis for the adaptive EG^ algorithm applied 
to the original losses £t{u) = \yt — u ■ Xt\" (without Lipschitzification) , namely, a bound of the order of 

UX{Y + [/A)"/2-i r"/Vrin(2rf) + (UX{Y + UX)"-^ + U^X^{Y + UX)""-^^ ln(2d) . 

13 



The main difference between the two regret bounds above hes in the dependence in U : our main regret 
term scales as UXY°'~^ while the one obtained without Lipschitzification scales as UX(Y + UX)°'/^~^ F"/^. 
The first term grows linearly in U while the second one grows as C/"/^, hence a clear improvement for a > 2. 

The last property stems form the fact that, thanks to Lipschitzification, the gradients V£t are bounded 

oo 

a.sU ^ +00 (cf. (|A.29|) in [Appendix A.2[ ). 
Remark 1 (Another benefit of Lipschitzification). 

Another benefit of Lipschitzification is that all online convex optimization regret bounds expressed in terms 
of the maximal dual norm of the gradients — i.e., maxi^f^T || V^f ||oo 'in our case — can be used fruitfully 
with the Lipschitzified loss functions it ■ For instance, in the case of the square loss, using the very last bound 
of Proposition [7J we get that 

T T 



t=i " t=i 



{yt-yty- inf } {yt - u ■ XtY ^ dUXY ( y^Tln{2d) + 8H2d)) + c^Y 



where Ci = 8(\/2 + l) and C2 ^ 4 (1 + I/V2) . The bound is no longer an improvement for small losses 
( as compared to Corollary 0), but it does not require to solve any quadratic inequality. The corresponding 



simple proof is postponed to the end of Appendix A. 2 



4. Adaptation to unknown U 

In the previous section, the forecaster is given a radius (7 > and asked to ensure a low worst-case 
regret on the £^-ball Bi{U). In this section, U is no longer given: the forecaster is asked to be competitive 
against all balls Bi{U), for U > 0. Namely, its worst-case regret on each Bi{U) should be almost as good 
as if U were known beforehand. For simplicity, we assume that X, Y, and T are known: we explain in 
Section [S] how to simultaneously adapt to all parameters. Note that from now on, we consider again the 
main framework of this paper, i.e., online linear regression under the square loss (cf. Section [LT|) . 



Parameters: X, F, 77 > 0, T ^ 1, and c > (a constant). 
Initialization: R = [log2(2r/c)] + , Wi = (^-^, • • • , e K^+i. 

For time steps t = 1, . . . ,T: 

1. For experts r = 0, . . . , R: 

• Run the sub-algorithm A{Ur) on the ball Bi{Ur) and obtain the pre- 
diction ^ff\ 

2. Output the prediction yt = Zf^o v^r""' (^^) [^t''^] f' 

3. Update u)j'^\ = w'f^ exp (^-rj{yt - [y^^ly)^) for r = 0, . . . , R. 



Figure 5: The Scaling algorithm. 



We define 



i?= riog2(2T/c)]+ and Ur ^ 



Y 



X ^Tln(2d) 
where c > is a known absolute constant and 

[x] + ^ min{fc G N : fc x} for all x G 
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, for r = 0, . . . , i? 



(14) 



The Scaling algorithm of Figure [S] works as follows. We have access to a sub-algorithm A{U) which we run 
simultaneously for all U = Ur, r — 0,. ..,R. Each instance of the sub-algorithm A{Ur) performs online 
linear regression on the -^^-ball Bi{Ur)- We employ an exponentially weighted forecaster to aggregate these 
i? + 1 sub-algorithms to perform onhne linear regression simultaneously on the balls Bi{Uo), . . . , 
The following regret bound follows by exp-concavity of the square loss. 

Theorem 4. Suppose that X,Y > are known. Let c, c' > be two absolute constants. Suppose that for 
all U > 0, we have access to a sub-algorithm A{U) with regret against Bi{U) of at most 



cUXY^yTln{2d) + c'Y^ for T ^ Tq , (15) 

uniformly over all sequences {xt) and (yt) bounded by X and Y . Then, for a known T ^ To, the Scaling 
algorithm with rj — 1/(81"^) satisfies 

T ( T ~\ 

y^{yt-ytf^ inf { y^{yt - u ■ Xt f + 2c \\u\\^XY^Tln{2d) I 

+ 8Y^ ln(riog2(2r/c)l+ + 1) + (c + c')Y\ (16) 

In particular, for every U > 0, 

T ( T ~\ 

V(2/t-yt)'< inf <^ V(2/t - u ■ Xtf \ + 2cUXY^T\n{2d) 
+ 8Y^ ln(riog2(2r/c)]+ + 1) + {c + c')Y\ 

Remark 2. By Remark[l\ the LEG algorithm satisfies assumption U5\) with Tq = ln{2d), c = 9ci = 
72(V2-I- 1), and c' = C2 = 4 (l -I- 1/^2)^ 

Proof: Since the Scaling algorithm is an exponentially weighted average forecaster (with clipping) applied 
to the R+1 experts A(lJr) — (j/t'^^)j>i> r = 0, . . . , i?, we have, by Lemma [S] in [Appendix B[ 

T T 



- ^*)' ^ .T'^ «E {yt^^ ~ y*) ' + HR + 1 



t=i ' t=i 



^ min <( inf | V(yt - m • Xtf \ + cUrXY JT\u(2d) \ + z , (17) 

r=G,...,R\ueBi(Ur)\j-^^ J V \ /i 



where the last inequality follows by assumption (I15L and where we set 

z ^ 8Y^ \n{R + 1) + c'Y^ . 

Let € argmin^igRd |X]tLi(yt ~ ' ^tf + 2c XY T ln{2d)^ . Next, we proceed by considering 
three cases: Uq < \\u^\\i < Ur, ^ ^^o, and ^ Ur. 

Case 1: Uq < \\u^\\i < Ur. Let r* = min{r ^ 0, . . . , R : Ur ^ Note that r* ^ 1 since > Uq. 

By ([T7)) we have 



T f T ^ 

y^{yt-ytf^ inf \y2{yt-u-Xtf\+cUr*XY^T\n{2d) 

T 

^(yt - u*j. ■ Xtf -t- 2c \\u*rr\\^XYyjT\n{2d) + z , 



+ z 
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where the last inequality follows from u'^ G Bi{Ur*) and from the fact that Ur' ^ 2 (since, by defini- 

tion of r*, lltt^lli > Ur--i = Ur-'/2). Finally, we obtain (US]) by definition of and z = 8Y^ ln{R+l) + c'Y'^ . 

Case 2: Wu'-^W-^ < J/q. By ^ we have 

- yO' < ^(yt - «T ■ a^t)' + cUoXY^T\n{2d) \+z , (18) 

t=i U=i J 

which yields by the equality cUoXYy/T\n{2d) = cY^ (by definition of Uq), by adding the nonnegative 
quantity 2c \\u^\\ ^ XY ^/Tln(2d} , and by definit ion of and z. 

Case 3: ||ity||]^ ^ Ur. By construction, we have yt G [— F, F], and by assumption, we have yt G [— Y,y], so 
that 

T T 

5I(yt - ytf ^ ^y^T ^ Y^^yt - u*j. ■ xtf + 2cURXY^T\n{2d) 

T 

< Y,{yt - ut ■ Xtf + 2c |iw^|iiXrVrin(2d) , 

t=l 

where the second inequality follows by 2cU rXY ^JtMM) = 2cY'^2^ > AY'^T (since 2^ ^ 2r/c by definition 
of R), and the last inequality uses the assumption ^ Ur. We finally get ([TBI) by definition of m^. 

This concludes the proof of the first claim ([TCI) . The second claim follows by bounding ^U. □ 



5. Extension to a fully adaptive algorithm 

The Scaling algorithm of Section [J] uses prior knowledge of F, Y/X, and T. In order to obtain a fully 
automatic algorithm, we need to adapt efficiently to these quantities. Adaptation to Y is possible via a 
technique already used for the LEG algorithm, i.e., by updating the clipping range Bt based on the past 
observations | i/s | , s ^ t — 1 . 

In parallel to adapting to Y , adaptation to Y/X can be carried out as follows. We replace the exponential 
sequence {Uq,. . . , Ur} by another exponential sequence {C/q,...,J7^,}: 

[/; = , r = 0,...,i?', (19) 

where R' ^ R + [loga T^'^] = [log2(2T/c)] + + [logs T'^''] : and where fc > 1 is a fixed constant. On the one 
hand, for T ^ To ^ max{(X/r)l/^ {Y/Xf^^}, we have (cf. dH]) and dH])), 

[Uo,Ur] c [U^U'r,] . 

Therefore, the analysis of Theorem H] applied to the grid {Uq, . . . , Ur/} vieldj^ a regret bound of the order 
of UXYVrJnd + ln(i?' + 1). On the other hand, clipping the predictions to [— Y,y] ensures the crude 
regret bound AY^Tq for small T < Tq. Hence, the overall regret for all T ^ 1 is of the order of 



UXYVTlnd + Y^\n{k\nT) + Y^ max{ {X/ F ) {Y/Xf^'' } . 

Adaptation to an unknown time horizon T can be carried out via a standard doubling trick on T. 
However, to avoid restarting the algorithm repeatedly, we can use a time-varying exponential sequence 



The proof remains the same by replacing 8Y^ ln{R + 1) with 8Y'^ \n{R' + 1). 
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{U'_jf,i^^^{t), . . . , t^^/(f-)(i)} where R'{t) grows at the rate of fcln(t). This giveJ^ us an algorithm that is fuUy 
automatic in the parameters U, X, Y and T. In this case, we can show that the regret is of the order of 

UXYVT In d + Y^k ln(r) + max { {Y/{VtX)Y^''^ , 

where the last two terms are negligible when T +00 (since fe > 1). 
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Appendix A. Proofs 

Appendix A.l. Proof of Theorem\^ 

To prove Theorem [2j we perform a reduction to the stochastic batch setting (via the standard online to 
batch trick), and employ a version of the lower bound proved in [2j for convex aggregation. 

We first need the following notations. Let T E W . Let {S, ^) be a probability space for which we can find 
an orthonornial familj0 {Vj)i^j^d with d elements in the space of square-integrable functions on S*, which 
we denote by L^(S', /Lt) thereafter. For all u and 7,(7 > 0, denote by PJ^"'^ the joint law of the i.i.d. 

sequence {Xt,Yt)\^tiiT such that 

Yt^-1'f^{Xt)+aetE^ , (A.l) 

where ipu — X]j=i "^jfj^ where the Xt are i.i.d points in S drawn from /i, and where the et are i.i.d standard 
Gaussian random variables such that {Xt)i<^tfiT and (£t)i^t^T are independent. 

The next lemma is a direct adaptation of [H, Theorem 2], which we state with our notations in a slightly 
more precise form (we make clear how the lower bound depends on the noise level a and the signal level 7). 

Lemma 1 (An extension of Theorem 2 of Q)- 

Let d,T Cz N* and 7,cr > 0. Let {S,fi) be a probability space for which we can find an orthonornial family 
i^j)i^j^d in 1L^(S', /i), and consider the Gaussian linear model (lA.ll) . Then there exist absolute constants 
C4,C5,C6,C7 > such that 




where the infimum is taken over all estimator¥^ fx ■ S ^ R, where the supremum is taken over all 
nonnegative vectors with total mass at most 1, and where ||/||^ = jg f (x)^ iJ.{dx) for all measurable functions 
f:S~>R. 



^^Each time the exponential sequence (C/,'.) expands, the weights assigned to the existing points are appropriately reassigned 
to the whole new sequence. 

'^■^An example is given by S = [— vr, tt], fj,{dx) = dx/(27r), and ipj{x) = \/2sin{jx) for all 1 ^ j ^ o! and x £ [— 7r,7r]. We will 
use this particular case later. 

-"^^As usual, an estimator is a measurable function of the sample {Xt, 5^t)i^t^T, but the dependency on the sample is omitted. 
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Note that the lower bound we stated in Theorem [5] is very similar to T times the above lower bound 
with 7 - X and o- - y (recaU that k = VTUX/{2dY)). The main difference is that the latter holds 
for unbounded observations, while we need bounded observations yt^ 1 ^ t ^ T. A simple concentration 
argument will show that these observations lie in [— Y, Y] with high probability, which will yield the desired 
lower bound. The proof of Theorem [5] thus consists of the following steps: 

• step 1: reduction to the stochastic batch setting; 

• step 2: application of Lemma [TJ 

• step 3: concentration argument. 



Proof (of Theorem [2]) : We first assume that A/ln(l + 2d)/(2dVhi2) ^ K ^ 1. The case when k > 1 will 
easily follow from the monotonicity of the minimax regret in k (see the end of the proof). We set 

T 4 1 + liMnf] , [/ 4 1 , and =^ , (A.2) 

so that T ^ 2, VTUX/{2dY) ^ n, and X ^ Y/2 (since VT ^ 4d/t). 



Step 1: reduction to the stochastic batch setting. 
First note that by clipping to [— F, Y], we have 



{T ^ T ^ 

^ \H\i<uf-^ I 

{T ^ ^ 

7^1 J 



(A.3) 



where the first infimum is taken over all online forecaster^^ (/t)ti where the second infimum is restricted 
to online forecasters {ft)t which output predictions in [— y, y], and where both suprema are taken over all 
individual sequences {xt,yt)i^ti:T e (K"* x R)"^ such that |yi|, . . . , |j/t| < Y and ||a;i||^ , . . . , ||a;T|loo ^ ^■ 

Next we use the standard online to batch conversion to bound from below the right-hand side of (jA.3p 
by T times the lower bound of Lemma [1] which we apply to the particular case where S = [— 7r,7r], where 
/i(dx) — da;/(27r), and where fjix) — \/2sin(jx) for all 1 ^ j ^ d and x € [— 7r,7r]. Let 

7 ^ csX and a ^ , (A.4) 

VlnT 

for some absolute constants cgjCg > to be chosen by the analysis. 

Let (/t)t^i be any online forecaster whose predictions lie in [— y, y], and consider the estimator fx defined 
for each sample {Xt,Yt)i<^t^T and each new input X' by 

fT[x';{Xt,Yt),^t^T) ^ -5]/t(7V(X');(7¥'(^.),>"s)K.^t-i) , (A.5) 



-"^^Recall that an online forecaster is a sequence of functions (/t)t^i , where ft : M'' X (R'* X R)*"-*^ — > R maps at time t the new 
input xt and the past data {xs,yB)i!^s^t-l to a prediction ft{xt\ (a^s, J/s)i^a^t-i) ■ However, unless mentioned otherwise, we 
omit the dependency in (tCs, J/s)i^s^t-i, and only write ft{xt). 
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where = {ipi , . . . ,ipd), and where we exphcitely wrote ah the dependencieJI^ of the ft, t = 1, . . . ,T. 
Take u* e achieving the supremurrF^ in Lemma [T] for the estimator fx- Note that Hm*]!]^ ^ 1. Besides, 
consider the i.i.d. random sequence {xt, yt)i^t^T in R'' x M defined for aU t = 1, . . . , T by 



Xt ^ {jipiiXt),...,^(pdiXt)) and yt = j(pu*{Xt) + cret , 



(A.6) 



where (pu* = X]j=i'"jVj (so that yt = u* ■ Xt + ast for aU t), where the Xt are i.i.d points in [— 7r,7r] 
drawn from the uniform distribution /i(da;) — Ax/ (27r), and where the et are i.i.d standard Gaussian random 
variables such that {Xt)t and {st)t are independent. Ah the expectations below are thus taken with respect 
to the probability distribution P^'f . 

By standard manipulations (e.g., using the tower rule and Jensen's inequality), we get the following lower 
bound. A detailed proof can be found after the proof of the present theorem (page [ 

Lemma 2 (Reduction to the batch setting). 
With {ft)i^t^T , fx, o,nd u* defined above, we have 



"^{yt - ft{xt))^ ~ II j|ff<iXl(2'* -u-Xt) 



^ TE 



fr - I'Pw 



Step 2: application of Lemma [T] 

Next we use Lemma[T]to prove that, for some absolute constants cg,cii > 0, 



TE 



h 



ciicl 



M ln(2 + 16d2' 
By Lemma [T] and by definition of u* , we have 



-dY^Ky/ln{l + 1/k) 



(A.7) 



04 rp 



iln(l 



T(lnT)"^ 



/taT W T \ cs^T{lnT)UX 



\ if c^X < _iL < g77rf 

VT'tJ VT ^ CT^ln(l+d) ■ 

if < 



CgdY 



if Crl< 



VT 



crjd 



r^ln(l+<i) 



(A.8 



where the last inequality follows from (|A.4I) and from U — 1. 



The above lower bound is only meaningful if the following condition holds true: 



VT cr^ln(l + d) 



(A.9) 



But, by definition of T = 1+ [(4(iK)2] and by the assumption ■y/ln(l + 2d) / (2dVhi 2) ^ n, elementary manip- 
ulations show that (jA.9p actually holds true wheneveiF^ cg ^ cjCsCiq, where cio = h inf rj; 



^^If the suprcmum in Lemma[T]is not achieved, then we can instead take an e-almost-maximizer for any £ > 0. Letting £ — > 
in the end will conclude the proof. 

l^By definition of 7 and a, i TOll is equivalent to TlnT > cl/{c'^cl){Y/X)'^ ln{l + d). But by definition of X and by the 
assumption k ^ Vln^TT^d) / {2dVln2) , we have Y/X ^ I/ciq. Therefore, iTOt is implied by TlnT ^ / (c'^cIcIq) ln(l + d), 
which in turn is implied by the condition cg ^ cycgcio (by definition of T). 
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(note that cio > 0). 

Therefore, if cg ^ crcgcio, then (jA.8l) entails that 



E 



/t - IVv 



^ min 



r(lnT) 



n-aT 



\ 



jln 1 



CgdY 



cs^T{\nT)UX 



(A.IO) 



Moreover, note that if cg ^ Cs2vhi2, then cs 5^ cg/(2\/hi2) ^ cg/(2\/lnT). In this case, since x i— >■ 
X'yin(l + A/x) is nondecreasing on R!j_ for all A > 0, we can replace cg with cg/(2\/lnT) in the next 
expression and get 



^/wT 



UXY 



\ 



CQdY 



cs^T{\nT)UX 



CeCg 



1 



21nT 



:UXY J -In [1 



2dY \ ^ cecl 
Vtux) ^ r(lnT) 



dY^Ky^\n{l + 1/k) , 



where we used the definition of k = VTUX/{2dY). 

In the sequel we will choose the absolute constants cg and cg such that 



Cg ^ C7C8C10 and Cg ^ C82\/ln2 



(A.ll) 



Therefore, by the above remarks, by the fact that InT = ln(l + [(4(iK)'^]) ^ ln(2 + IGd^) (since k ^ 1 by 
assumption), and multiplying both sides of (jA.lOl) by T, we get 



TE 



^ min • 



ln(2 + 16rf2) ' ln(2 + 16rf2) ^ ^ ' ^ 



CllCg 



ln(2 + 16^2) 



dY^KyJ\n{l + 1/k) 



where we set cn = min{c4/ \/ln 2, cg} , and where used the fact that x 1— >■ x-\/ln(l + l/x) is nondecreasing 
on E;^, so that its value at x = k ^ 1 is smaller than \/ln 2. This concludes the proof of (|A.7p . 



Combining Lemma [3] and (jA.7|) . we get 



E 



^{yt - ft{xt)f - inf ^{yt-u- Xt) 

— \\u\\ , <1 ^ — ' 



t=l 



C11C9 



ln(2 + 16^2) 



dY'^K^J\n{l + l/n) . 



(A.12) 



Step 3: concentration argument. 

At this stage it would be tempting to conclude by using (|XT2]) that since the expectation is lower bounded, 
then there is at least one individual sequence with the same lower bound. However, we have no boundedness 
guarantee about such individual sequence since the random observations yt lie outside of [— y,y] with 
positive probability. Next we prove that the probability of the event 

A^l\{\yt\^Y) 



is actually close to 1, and that 
E 



Lt — inf Lt(u) 

ii«ii,s;i 



CllCg 



1 

2 In (2 + 16^2 ■ 
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-dY^KyJ\YY{l + 1/k) 



(A.13) 



(Note a missing factor of 2 between (|A.12p and (|A.13p .) The last lower bound will then enable us to conclude 
the proof of this theorem. 

Set Lt = X]t=i (y* ~ ft{xt)) and Lt{u) = 'J2t=i {ut — u- Xtj for all u e M . Denote by A"^ the complement 
of A, and by and I^c the corresponding indicator functions. By the equality = 1 — I^c, we have 



E 



inf Lt(u 

u||,<l 



E 



inf Lt(u) 

u||,<l 



E 



— inf Lt{u) 



ciicl 



In (2 + 16^2 



-dr2K0n(l + 1/k) - E 



I_4c 



(A.14) 



where the last inequality follows by (jA.12p and by the fact that Lt{u) ^ for all u E R"*. The rest of the 
proof is dedicated to upper bounding the above quantity E[lyiciy] by half the term on its left. This way, 
we will have proved (|A.13p . 



First note that 



E 



^E 



E 



t=l 

T 



\yt\>Y} 



1 

4rr2p(^^)+^E[(y,-7,(a.,))'lI{|,,|>^} 



(A.15) 
(A.16) 



where (jA.lSp follows from the fact that the online forecaster {ft)t outputs its predictions in [— F, F]. As 
for (jA.16p . note by definition of yt that \yt\ ^ \\u*\\^ 7 ||y>(Xt)||^ + a\et\ ^ -f\/2 + a\et\ since < 1 and 

— Iv^ sin(ja;)| ^ for all j = 1, . . . ,d and a; e K. Therefore, by definition of 7 = c^X, and since 
X < Y/2 (by definition of X), we get \yt\ ^ csV2Y/2 + a\et\ ^ Y/2 + a\£t\ provided that 



V2 



(A.17) 



which we assume thereafter. The above remarks show that {\yt\ > Y} C {\et\ > Y/{2a)}, which entails 
(jA.16p . By the same comments and since \ft\ ^ Y, we have, for alH = 1, . . . , T, 



E 



{y, - /t(a;0)'l{|,,|>^}] ^ E[(r/2 + a\e,\ + i^)'l{|,,|>^} 



^ 2 



3Y 



\£t\ > 



Y 
2^ 



-2a^E 



9^2 



|.|>|)+2.W/^(|e.|>^) 



.clY^ 



In 2 



(A.18) 
(A.19) 
(A.20) 



where we used the following arguments. Inequality (jA.lSp follows by the elementary inequality (a + b)^ ^ 
2{a^ + iP) for all a, & G M. To get (|A.19p we used the Cauchy-Schwarz inequality and the fact that E\_sf\ = 3 
(since £t is a standard Gaussian random variable). Finally, (|A.20I) follows by definition of cr = cgF/Vlnr ^ 
CgY/Vln 2 and from the fact that, since £t is a standard Gaussian random variable. 



ktl > :t- ) ^ 2e^^^^y = 2e 



2g 



2 I 2cn ; 



2y-l/(8c2 
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Using the fact that P(y t^) ^ E LiP(ly tl > Y) ^ Y.J=i^{\>^t\ > Y/i^cr)) 2T^'^/<^^''9) by the inequahty 
above and substituting (IA.20[) in (jA.16[) . we get 



E 



2c|V6 
In 2 



y2yl-l/(16c5 



In 2 ' 



(A.21) 



where the last inequahty fohows from the fact that T" ^2" for ah a < (since T ^ 2) and from a choice 
of cg such that cg < 1/4 (which we assume thereafter) . 

In order to further upper bound E[I^o Lt], we use the foUowing technical lemma, which is proved after the 
proof of the present theorem (see page lMj) . It relies on the following elementary argument: since is large 
enough and since the left-hand side of the next inequality (Lemma [S]) decreases exponentially fast as cg — >■ 0, 
then this inequality holds true for all cg > small enough. 



Lemma 3. There exists an absolute constant C13 > such that, for all cg G (0,ci3), 



In 2 21n(2 + 16d2 



-dY^K^y\n{l + 1/k) . 



We can now fix the values of the constants cg and cg and conclude the proof. Choosing cg and cs = 
max{ Cg / (2Vln 2) , Cg / (cycio ) } such that cs < l/\/2 (condition (IA.17p ). cg < 1/4, and cg < C13, then the 
condition (lA.lip also holds, and (jA.2H) combined with Lemma |3] entails that 



E 



1 CllCg 



2 ln(2 + 16d2) 
Substituting the last inequality in (jA.14|) . we get that 



dY^K^yln{l + 1/k) . 



E 



( Lt — inf Lt{u) 

\\u\\ T <1 



21n(2 + 16(^2) 



By the above lower bound and the fact that, P^i^ -almost surely, ||a;t||oQ ^ 7 a/2 ^ X for all t = 1, . . . ,T 
(since 7 = c^X and cg ^ 1/ a/2) , we get that 



I|ccih^,...,||cdtIU^x I V J) 21n(2 + 16(i2j 

i;i,...,i/t6K 



dY^Ky^\n{l + 1/k) . 



Therefore, by definition of A = f\[=i{\yt\ ^ Y}^ of = J2j=i{yt - ft{xt)Y , and of Lt(m) = J2l'=iiyt 
u ■ Xtf, we get that, for all online forecasters (/t)t^i whose predictions lie in [— y, F], 

C T T ■ 



\yi\.---,\yTKY 



dY"^ k^/MT+T/k) 



Combining the last lower bound with (|A.3p and setting ci = ciiCg/2 concludes the proof under the assump- 
tion ^ln(l + 2d)/(2d\/In2) < k < 1. 



Assume now that n > \ . 

The stated lower bound follows from the case when k = 1 and by monotonicity of the minimax regret in n 
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(when d and Y are kept constant). 



More formaUy, by the first part of this proof (when k — 1), we can fix T ^ 1, C/i > 0, and X > 
such that VTUiX/{2dY) = 1 and 

{T T ^ 

51 (y* - ftixt)f - „ inf ^(yt - u ■ xtf \ ^ dV'^^/wi , 

fr^' ll«ili«;(7i^ I ln(2 + 16d2) 

where the infimum is taken over aU onhne forecasters (/t)t^i, and where the supremum is taken over all 
individual sequences bounded by X and Y . 

Now take K > 1, and set U ^ kUi > Ui, so that VTUX/{2dY) = k (since VTUiX/{2dY) = 1). Moreover, 
for all individual sequences bounded by X and Y, the regret on BiiU) is at least as large as the regret on 
Bi(Ui) (since U > Ui). Combining the latter remark with the lower bound above and setting C2 = ci Vln 2 
concludes the proof. □ 



Proof (of Lemma [2|): We use the same notations as in Step 1 of the proof of Theorem [21 Let {X',y') 
be a random copy of {Xi,yi) independent of the sample {Xt,yt)i^t^T, and define the random vector 
x' = {^j(pi{X'), . . . ,'-fipd{X')y By the tower rule, we have 

E[{yt-Mxtf] ^E[E[{yt-Mxt))'\{x,,ys)s^t^i]] ^ E[{y' - Mx')'] , 

where we used the fact that ft is built on the past data {xs,ys)s^t-i and that {x',y') and {xt,yt) are both 
independent of {xs,ys)si^t-i and are identically distributed. Similarly E[{yt — u ■ Xt)^] = E[{y' — u ■ x')^] . 
Using the last equalities and the fact that E[inf{. . .}] < inf E [{...}], we get 



^ — ' \\u\\ , <1 ^ — ' 

>T(^j2^[{y'-Mx')) 

E[{y' - MX')) 



t=i 

V 2 



- inf E 



[y — u ■ X 



^ T 

= T 
= T 



inf E 



y — u ■ X 



(7^.-(X')-/t(X'))^ 

.2 



(A.22) 
(A.23) 



Inequality (|A.22p follows by definition of /t = T ^ X^tli /* (^^^ (|A.5() ) and by Jensen's inequality. As for 
Inequality (jA.23p . it follows by expanding the square 

{y' - Ux')f = (7^..(X') - Jt{X') +y'- -iv^-^{X')f , 

by noting that E[y' - ^ipu-{X')\X'] = (via (TO)) ) and by the fact that 



inf E 

ll"ll,s;i 



[y — u ■ X 



= E 



{y' -^t^^,{X')y 



where we used \\u*\\^ ^ 1 (by definition of u*) and u ■ x' = j(pu{X'). This concludes the proof. 



□ 
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Proof (of Lemma [3|): We use the same notations and assumptions as in the proof of Theorem [21 Since 
the function x i— > a;-\/ln(l + 1/x) is nondecreasing on M!j_ and since n ^ Kau^ = ■\/ln(l + 2d)/(2d^/ln2) by 
assumption, we have 

ciicl 



ln(2 + 16^2) 



ciic| 



ln(2 + 16<i2) 



dY'^Kmin\/\n{l + l/^min) 



2 0n(l + 2d) win 1 + 2dVln2/^ln(l + 2d) 



2Vln2 



hi(2 + 16d2) 



^ Ci2 , 



2Vln2 



(A.24) 
(A.25) 



where Ci2 denotes the infimum of the last fraction of (jA.24p over all d ^ 1; in particular, ci2 > 0. It is now 
easy to see that by choosing the absolute constant C13 > small enough (where C13 can be expressed in 
terms of cn and C12), we have, for all cg G (0, C13), 



ln2 ~" 2Vhr2 

Multiplying both sides of the last inequality by and combining it with (IA.25P concludes the proof. □ 



8 . 22-i/(8c^) + 9 . 2i-i/(8c^) + ZZ^2' — ^ -^^Ci2 

In 2 2Vh[2 



Appendix A. 2. Proofs of Theorem\^ and Remark]^ 

Proof (of Theorem[3]): The proof follows directly from Proposition[T]and from the fact that the Lipschitz- 
ified losses are larger than their clipped versions. Indeed, first note that, by definition of yt and Bt+i ^ \yt\, 
we have 

T T T 

i=l t=l t=l 

t:\yt\'^Bt t:\yt\>Bt 

T T 



t=i 

t.\vtKBt 



t'-Bt+\>Bt 



<^£t(Mt)+4(l + 2-i/")"y" 



(A.26) 



where the second inequality follows from the fact that: 

. if \yt\ sC Bt then \yt - [ut ■ aJtlsJ" ^t(wt) by Eq. (USD; 

• if \yt\ > Bt, which is equivalent to Bt+i > Bt by definition of Bt+i, then Bt ^ Bt+i/2^/°' , so that 
Bt+i+Bt < (l + 2-i/")B,+i. 

As for the third inequality above, we used the non- negativity of £t{ut) and upper bounded the geometric 
sum J2t-Bt+i>Bt -^t+i i'^ the same way as in ll|. Theorem 6], i.e., setting K = [log2 maxi^t^r 1?/*!"] : 

T K 
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To bound (|A.26[) further from above, we now use the fact that, by construction, the LEG algorithm is the 
adaptive EG^ algorithm applied to the modified loss functions it- Therefore, we get from Proposition [T] 
that 

T T 

y^Itiut)^ inf y^Itiu) 



^ [Yl II ^^t{ut) ^ j ln(2d) + U (81n(2(i) + 12) m^ax^ V£t(Mt) 



(A.27) 



We can now follow the same lines as in Corollary [51 except that we use the particular shape of the Lip- 
schitzified losses. We first derive some properties of the gradients Wit- Observe from the definition of it 
in Section [531 that in both cases \yt\ > Bt and \yt\ < Bt, the function it is continuously differentiable. 
Moreover, if |yt| ^ Bt, then 

Vtt e R"^ , \/it{u) = ~asgn{yt ~ [u ■ Xt]Bt) \yt - [u ■ Xt]Bt\°'^^ Xt , 
where for all x € M, the quantity sgn(a;) equals 1 (resp. — 1, 0) if x > (resp. x < 0, a; = 0). 



Therefore, in both cases \yt \ > Bt and 



with Lipschitz constant sup^gjjd 



yt\ ^ Bt, the function it is Lipschitz continuous with respect to ||-||-|^ 
bounded as follows: for all u e K'^, 



\/it{u) a \yt - [u ■ Xt]Bt\" ^ \\xt\\r^ 

OO 

< a {\yt\ + Bt)"-' \\xt\\^^a{l + 2'/")-' max |y, 

where we used the fact that Bt ^ 2^/" maxi^3^t_i \ys\. 

We can draw several consequences from the inequalities above. First note that, by (IA.29|) . 

max ||V£t(Mt)||oo <a(l + 2i/")""'xy"-i . 



Xt 



(A.28) 
(A.29) 



(A.30) 



Moreover, using (|A.28p and the definition of yt in Figure [H we can see that the gradients V^t(Mt) satisfy 
^itiut) 



^Oi \yt- ytT ^ Iktlloo ^ \yt - ytl" '■ This entails that 



|2q-2 



< a^X^\yt-yt 

< a2x2((l + 2i/")y) 



= a'^x^\yt - yt 
\yt-yt\ , 



a-2 



\yt - yt\ 



(A.31) 



where we used the upper bounds \yt\ ^ Y and \yt\ = [ut ■ Xt] 



and (|A.31[) in (|A.27p and combining the resulting bound with (|A.26[) . we get 



St 2i/"r. Substituting ([X30| 



^ \yt - ytr ^ „ ,E^*(«) + ^o^UXY-/'-' 

+ (8 ln(2d) + 12) 5a UXy-^ + 4(l + 2-i/")"r" , 



^C2 
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where we set = 4a (l + 2i/")"^^ ^ and ^ a (l + 2^/")" \ 

To simplify the notations we also set Lt — X]t=i 1^* ~ ^^'^ ^ n^iii||M||i^;7 X^tLi ^t{u), so that the 
previous inequality can be rewritten as 



Solving for Lt via Lemma [H in [Appendix B| (used with a = + C'l + C2 and 6 = aallXY"/'^^^ ^y\r^{2d)), 
we get that 

s; + a„c/xr"/2"i y^Z^ in(2d) 

+ ac,J7Xy"/2-i ^(Ci +C2)ln(2d) + J/^X^y"-^ ^^^^2^) + Ci + C2 . (A.32) 



To conclude the proof, it just suffices to bound the term aalJ XY°'/'^ ^ (Ci + C2) ln(2(i) from above. First 
note that 



+ C2) ln(2d) ^ ^Ci ln(2d) + ^^2 ln(2rf) 

^ V^C*! ln(2d) + 2(1 + 2-i/")"/V"/2yn^ , (A.33) 



where the last inequality follows by definition of C2 above. Now, to upper bound Ci ln(2(i), we note that, 
by definition of Ci, 



^/Cl \n{2d) = \n{2d) ^ (8 + 12/ln(2d)) UXY<^-^ 



«;ln(2d)y(8 + 12/ln2)&„ ^ 

where we used the elementary upper bound \/ab ^ (a+6)/2 with a = L/XF"/^^^ and 6 = ya/2 Substituting 
the last inequality in (jA.33[) and using •y/ln(2c?) ^ ln(2(i)/-\/ln 2, we finally get that 



a„i7Xy"/2-i v/(Ci + C2) ln(2d) 



^ ac,ln(2d) ( J6a(4 + 6/ln2) +2(1 + 2-1/")"^^^^) UXY°'~^ 



+ aaln(2d) y^6„(4 + 6/ln2) C/^AT^F^-^ ^ 
Substituting the last inequality into (jA.32[) and rearranging terms concludes the proof. □ 

Proof (of Remark [T|) : Recall that in this remark, we focus on the square loss (i.e., a = 2) and that we 
set ci = 8(-\/2 + 1) and C2 = 4 (l + l/-\/2) ■ By the key property (IT51) that holds for all rounds t such that 
\yt\ ^ Bt (the other rounds accounting only for an additional total loss at most of 02^^^, see (IA.26P ). we get 

T T T T 

J2{yt - yt? - „ inf Y.^yt - u ■ xtf ^ Y.^t(ut) - „ inf ^Ft(«) + C2Y^ 



^ 4J7 max 

l<t<T 



( VT^M2d) + 2 ln(2d) + 3) + caF^ (A.34) 



^ cit/AF (v/Tln(2d) + 81n(2d)) +C2y^ , (A.35) 



where (jA.34l) follows from the remark in Proposition [T] involving the uniform bound maxi^t^^ || V£t||oo, 
and where (jA.35[) follows from maxi^t^^ ||V^t||oo =^ 2(l + ^/2}jXY (by (jA.291) ') and from the elementary 
inequality 3^6 ln(2c?) . □ 
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Appendix B. Lemmas 

The next elementary lemma is due to pll Appendix III]. It is useful to compute an upper bound on the 
cumulative loss Lt of a forecaster when Lt satisfies an inequality of the form (jB.ip . 



Lemma 4. Let a,b ^ 0. Assume that x ^ satisfies the inequality 

x^a + b^/x. (B.l) 

Then, 

X ^ a + h\fa + \? . 

The next lemma is useful to prove Theorem [TJ At the end of this section, we also provide an elementary 
lemma about the exponentially weighted average forecaster combined with clipping. 

Lemma 5. Let d,T £ N* , and U,X,Y > 0. The minimax regret on Bi{U) for bounded base predictions 
and observations satisfies 

{T T ^ 

V(yt - ytf - „ inf .y'lyt - « ■ ^tf \ 
7^1 \Mi<u^^ J 





Vtux\ 






+ dY^^ 




dY J 





s; min <^ 3UXYy/2T\n{2d), 32 dY^ In 1 



where the infimum is taken over all forecasters F and where the supremum extends over all sequences 
{xt,yt)i^t^T e {R'^ X Rf such that |yi|, . . . , |?/t| ^ Y and ||a;i||^ , . . . , ||a;T|loo =^ ^■ 



Proof: We treat each of the two terms in the above minimum separately. 

Step 1: We prove that their exists a forecaster F whose worst-case regret on Bi(U) is upper bounded by 
3UXY y/2Tlii{2d). 



First note that if C/ ^ (F/A) v/T/(2 ln(2d)), then the upper bound 3UXY .y2T\n{2d) > 3TY^ ^ TY^ 
is trivial (by choosing the forecaster F which outputs yt — at each time t). 

We can thus assume that U < (y/A)v/T/(21n(2d)). Consider the EG± algorithm as given in |, 
Theorem 5.11], and denote by Ut G Bi{U) the linear combination it outputs at each time t ^ 1. Then, by 
the aforementioned theorem, this forecaster satisfies, uniformly over all individual sequences bounded by X 
and Y, that 

T T 

y^iyt - Ut ■ xtf - inf V(yt - u ■ xtf 

— \\u\\ 1 — 

t=l " t=l 



< 2UXYy/2T\n{2d) + 2U'^X^ \n{2d) 



< 2UXY^2T\ni2d) + 2 (^]J ^J^^^^ ^ UXln{2d) (B.2) 

< 3C/AyV2rin(2d) , 



where (|B.2[) follows from the assumption UX < Y ^T j (2 ln(2d)). This concludes the first step of this proof. 
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step 2: We prove that their exists a forecaster F whose worst-case regret on Bi{U) is upper bounded by 



Such a forecaster is given by the sparsity-oriented algorithm SeqSEWf of I4I (we could also get a 
slightly worse bound with the sequential ridge regression forecaster of [l3 , 14 1 ) . Indeed, by [12!, Proposition 1] , 
the cumulative square loss of the algorithm SeqSEWf tuned with B = Y,-q = 1/(8^2) and r = YI{s/TX) 
is upper bounded by 



32||M||oy^ln 1 



^ inf 

Hull, <(7 



, t=i 



Uf ~u - Xt) 



32^^" In 1 



VtxMi] 

MoY ) 
TXU\ 



dY 



dY^ 



where the last inequality follows by monotonicit\F1 in ||m||q and \\u\\^ of the second term of the left-hand 
side. This concludes the proof. □ 



Next we recall a regret bound satisfied by the standard exponentially weighted average forecaster applied 
to clipped base forecasts. Assume that at each time t ^ 1, the forecaster has access to K ^ 1 base forecasts 
?/( G R, fc = 1, . . . , if , and that for some known bound y > on the observations, the forecaster predicts 
at time t as 

K 



,(fe)- 

fe=i 



yt = ^Pk.t[yi 



Y 



In the equation above, [x\y = min{y, max{— y, .x}} for all a:; e M, and the weight vectors £ are given 
by . . . , 1/K) and, for aU i = 2, . . . , T, by 



exp [ -111^^=1 W'^ - 



Pk.t = ^^-7 ,\ , 1 < fc ^ iiT , 



. 2 

\Y 

for some inverse temperature parameter 77 > to be chosen below. The next lemma is a straigthforward 
consequence of Theorem 3.2 and Proposition 3.1 of 17]. 

Lemma 6 (Exponential weighting with clipping). Assume that the forecaster knows beforehand a bound 
Y > on the observations \yt\, t — I, . . . ,T. Then, the exponentially weighted average forecaster tuned with 
r] ^ l/(8y^) and with clipping [■]y satisfies 



2 , . v^/ ^^k^^■2 In A 

yt - yi 



Proof (of Lemma [6]): The proof follows straightforwardly from Theorem 3.2 and Proposition 3.1 of [l7| . 
To apply the latter result, recall from [3: Remark 3] that the square loss is l/(8y^)-exp-concave on [—Y, Y] 
and thus 77-exp-concavq^ (since 77 ^ 1/(8^^) by assumption). Therefore, by definition of our forecaster 
above. Theorem 3.2 and Proposition 3.1 of [l7| yield 



T T 

2 l-^X 



t=i ^ ^ t=i 



+ 



?7 



-'^'^Note that for all A > 0, the function x 1— > xln{l + A/x) (continuously extended at x = 0) has a nonnegative first derivative 
and is thus nondecreasing on IR4.. 

-"^^This means that for all y £ [— y,y], the function x h- > cxp(^—rj{y — x)^) is concave on [— y,y]. 
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To conclude the proof, note for all t — 1, . . . ,T and k — 1,. . . ,K that |t/t| ^ F by assumption, so that 
clipping the base forecasts to [— Y,F] can only improve prediction, i.e., (jjt — =^ {ut — ift'^) ■ CH 



Appendix C. Additional tools 

The next approximation argument is originally due to Maurey, and was used under various forms, e.g., 
in (see also Q). 

Lemma 7 (Approximation argument). LetU>0 andmCzN*. Define the following finite subset of Bi(U): 



Bu,m — 



m m J ^—^^ I 



Then, for all {xt,yt)i^t^T G {^'^ x such that maxi^t^T ll^CtHo^ ^ ' 



t=i " ' t=i 

Proof: The proof is quite standard and follows the same lines as [H Proposition 5.2.2] or ^, Theorem 2] 
who addressed the aggregation task in the stochastic setting. We rewrite this argument below in our online 
deterministic setting. 



Fix u* e argmin^g^^([;) J2t=i(yt ~ ^ ' ^t) ■ Define the probability distribution tt = (tt-^, . . . , tt^) G M_,_' 

by 



2d+l 



V 



ifj ^-1; 



Let Ji, . . . , J„j G {— d, . . . , d} be i.i.d. random integers drawn from tt, and set 



M = — > 



where (ej)i^j^(i is the canonical basis of R'', where eo = 0, and where e_j = — for all 1 ^ j d. Note 
that u G m by construction. Therefore, 



inf ^(yt - u ■ xtf ^ 



E^'"' t=l 



,t=i 



(C.l) 



The rest of the proof is dedicated to upper bounding the last expectation. Expanding all the squares 
{yt — u ■ XtY — {yt ^ W* ■ Xt + u* ■ Xt — u ■ XtY , first note that 



E 



^{yt - u ■ xtf 



t=i t=i 

T 

+ 2 ^(yt - u* ■ Xt)E[u* ■ xt-u - Xt] ■ 



(C.2) 



29 



But by definition of u and tt, 

d 

E[u] = UE[ej^] =U TTjEj 



3 = 1 



U ' 



u 



so that E[m • = u* ■ Xt for all 1 ^ t ^ T. Therefore, the last sum in (jC.2p above equals zero, and 



[U -Xt-U-Xt 



Varfw • a;tj = — > Varfej^ • a;t) 

fe=l 



where the second equality follows from u ■ Xt = {U/m) X^fcLi ^Jk ' ^'^'^ from the independence of the Jk, 
1 ^ k ^ m, and where the last inequality follows from \ej^ ■ Xt\ ^ ||ej^||-^ ^ X for all 1 ^ fc ^ m. 



Combining (|C.2p with the remarks above, we get 



"^iyt-u-xtf 



t=i 



= inf y^(yt-u-xt) 

where the last line follows by definition of u* . Substituting the last inequality in (jC.ll) concludes the 
proof. □ 



The combinatorial result below (or variants of it) is well-known; see, e.g., We reproduce its proof for 

the convenience of the reader. We use the notation e = exp(l). 



Lemma 8 (An elementary combinatorial upper bound). 

Let m,d Cz N* . Denoting by \E\ the cardinality of a set E, we have 



e(2d + m) 



Proof (of Lemma [8]): Setting {k'_^,k'^) = {^{kj)^, ^oi all 1 ^ j ^ d, and k'^ = m — X]^=i l^jl' 

have 



(fci,...,fc<i)eZ'^:^|A:,K 



3 = -d 

2d^m> 



m 

e(2d + m) 



(C.3) 
(C.4) 



To get inequality (jC.3[) . we used the (elementary) fact that the number of 2rf + 1 integer- valued tuples 
summing up to m is equal to the number of lattice paths from (1,0) to {2d + 1,to) in N^, which is equal 
to (^'^'''^m'""^) ■ ^'-'■'^ inequality (|C.4I) . it follows straightforwardly from a classical combinatorial result 
stated, e.g., in Proposition 2.5]. □ 
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